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Abstract 

We present a nonparametric Bayesian 
method for disease subtype discovery in 
multi-dimensional cancer data. Our method 
can simultaneously analyse a wide range 
of data types, allowing for both agreement 
and disagreement between their underlying 
clustering structure. It includes feature 
selection and infers the most likely number 
of disease subtypes, given the data. 

We apply the method to 277 glioblastoma 
samples from The Cancer Genome Atlas, for 
which there are gene expression, copy number 
variation, methylation and microRNA data. 
We identify 8 distinct consensus subtypes 
and study their prognostic value for death, 
new tumour events, progression and recur- 
rence. The consensus subtypes are prognos- 
tic of tumour recurrence (log-rank p- value of 
3.6 X 10^** after correction for multiple hy- 
pothesis tests). This is driven principally 
by the methylation data (log-rank p-value of 
2.0 X 10"'^) but the effect is strengthened by 
the other 3 data types, demonstrating the 
value of integrating multiple data types. 

Of particular note is a subtype of 47 patients 
PrpspntpH at, Thp. 29^^ International Conference on Ma- 
chine Learning, Edinburgh, Scotland, UK, 2012: Workshop 
on Machine Learning in Genetics and Genomics. Copyright 
2012 by the author(s)/owner(s). 



characterised by very low levels of methy- 
lation. This subtype has very low rates of 
tumour recurrence and no new events in 10 
years of follow up. We also identify a small 
gene expression subtype of 6 patients that 
shows particularly poor survival outcomes. 
Additionally, we note a consensus subtype 
that showly a highly distinctive data signa- 
ture and suggest that it is therefore a biolog- 
ically distinct subtype of glioblastoma. 

We note that while the consensus subtypes 
are highly informative, there is only par- 
tial overlap between the different data types. 
This suggests that when considering multi- 
dimensional cancer data, the underlying biol- 
ogy is more complex than a straightforward 
set of well-defined subtypes. We suggest that 
this may be a key consideration when mod- 
eling such data. 

The code is available from 

https://sites.google.com/site/multipledatafusion/ 

1. Introduction 

Cancer is a complex disease, driven by a range of ge- 
netic and environmental effects. It is responsible for 1 
in 8 deaths worldwide (ACS, 2011), with an estimated 
7.6 million cancer deaths worldwide in 2008 (Jemal 
et al., 2011). Understanding the cancer genome (Strat- 
ton et al., 2009) and the associated molecular mech- 
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anisms is therefore a vitally important global medical 
issue. 

Modern large-scale cancer studies present great new 
opportunities to understand different types of can- 
cer and their underlying mechanisms. Projects 
such as The Cancer Genome Atlas (TCGA) 
(http://cancergcnome.nih.gov/) and METABRIC 
(Curtis ct al., 2012) and the International Cancer 
Genome Consortium (ICGC, 2010), are producing 
large, multi-dimensional data sets that have the 
potential to revolutionise the study of cancer. 

One of the first TCGA projects is a study of glioblas- 
toma, (TCGA, 2008) the most common primary brain 
tumour in human adults. Glioblastoma is an aggres- 
sive cancer; patients with newly diagnosed glioblas- 
toma have a median survival of « 1 year. The TCGA 
glioblastoma data set is a hugely relevant resource for 
improving this situation. 

Key to the utilisation of these multi-dimensional data 
sets is to develop effective data fusion methods (see e.g. 
Shen et al., 2009; Savage et al., 2010; Yuan et al., 2011; 
Kirk et al., 2012). It is not enough to simply concate- 
nate the different data types; one must account for the 
different statistical characteristics of each data type 
and that they may contain differing or even contra- 
dictory information about the samples studied. There 
is potentially huge benefit to proper analysis of such 
multi-dimensional data sets (e.g. consider the number 
of pairwise data type comparisons as the total number 
of types increases). But to fully realise this, we must 
develop new methods. 

The Multiple Data Integration (MDI) algorithm (Kirk 
et al., 2012) is a principled framework for the iden- 
tification of cancer subtypes. It can analyse multi- 
dimensional data sets, combining a range of individual 
data types such as gene expression, copy number varia- 
tion, methylation and microRNA data. MDI can be re- 
garded as the extension to multiple (possibly disagree- 
ing) data types of nonparametric Bayesian clustering 
methods such as the Dirichlet Process (DP) mixture 
model (see e.g. Ferguson, 1973; Antoniak, 1974; Esco- 
bar & West, 1995; Dahl, 2006; Rasmussen et al., 2007). 

The key advantage of MDI is that it allows for the pos- 
sibility of both agreement and disagreement between 
the clustering structures of different data types within 
a given analysis. This is extremely important in bi- 
ological data sets. For example, gene expression can 
be regulated by a number of biological mechanisms, 
so is not determined solely by the underlying genome. 
Hence integrating gene expression and copy number 
variation data might or might not result in good agree- 



ment, depending on the biological context. 

MDI produces clustering partitions for each data type, 
as well as an overall consensus clustering partition. It 
also identifies the degree to which different data types 
share common structure, and can identify which of the 
items are fused across the different data types. 

To extend the MDI method to the analysis of cancer 
data, we have added additional functionality beyond 
that of Kirk et al. (2012) Two data models (Gaussian 
and Multinomial) are used. For each of these data 
models, feature selection has been added, so that the 
most informative features can be identified for each 
data type. Additionally, it is known that the MCMC 
chains in mixture-based clustering methods can be 
slow to mix when using a Gibbs sampler. To improve 
performance in this regard, an additional split-merge 
MCMC sampler has been added, which is used in con- 
junction with the existing Gibbs steps. 

MDI therefore has the following advantage in 
analysing post-genomic molecular cancer data. 

• Infers (Rather than assumes) the degree to which 
clustering structure is shared between data types 

• Infers the likely number of clusters given the data 

• Identifies the genes/probes in each data type that 
define the disease subtypes 

• Integrate simultaneously a wide range of data 
types (4 in this paper; more can easily be included 
if available) 

The rest of the paper is summarised as follows. In 
Section 2 we describe the data set. In Section 3 we 
describe MDI and present several improvements to 
the method. In Section 4 we present the results of 
analysing the TCGA glioblastoma data. Finally, in 
Section 5 we draw conclusions about this work. 

2. Data 

We downloaded glioblastoma data (TCGA, 
2008) from the TCGA data portal 
(http://cancergenome.nih.gov/), including gene 
expression, copy number variation, methylation 
and microRNA data, as well as clinical follow-up 
information. After matching samples across all 4 data 
types, we are left with 277 samples for which we have 
complete data. We note that in a few cases (and for 
a given data type) there are duplicate samples for the 
same patient. In this case we make a blind selection 
of the first sample, based on bar code ordering. 
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All data were downloaded from the TCGA data portal 
on lath April 2012. 

2.1. Gene Expression 

We use the publicly-available level 3 gene expression 
data. For consistency, data were chosen from a single 
platform. The UNC AgilentG4502A 07 samples were 
selected as they were most numerous, giving a total 
of 571 tumour samples. These were read into a single 
data matrix and NaN (missing) values set to zero. 

The data include 10 normal samples. For each gene, 
a Wilcoxon rank-sum test was used to determine 
whether or not there was differential expression be- 
tween tumour and normal samples. A Bonferroni 
correction and p-value threshold was applied {p < 
2 X 10~'^), leaving 1011 gene expression features. 

2.2. Copy Number Variation 

We use the publicly-available level 2 copy number 
data. We chose level 2 so that we had access to all 
probes (the level 3 data are segmented into regions 
which, in general, are different from sample to sam- 
ple). This gave 466 tumour and 376 normal samples 
generated by MSK C using the HG-CGH-244A plat- 
form. These were read into a single data matrix and 
NaN (missing) values set to zero. 

For each probe, a Wilcoxon rank-sum test was used 
to determine whether or not there was differential 
copy number variation between the normal and tu- 
mour samples. A Bonferroni correction was then ap- 
plied to the p-values. A large number of probes had 
highly significant p-values, many of which contained 
similar information. It was therefore decided on prac- 
tical grounds to keep only the 1000 most significant 
probes as features for this analysis. 

2.3. Methylation 

We use the publicly-available level 3 methylation data. 
This gave 285 samples generated by JHU-USC on 
the HumanMethylation27 platform. The data were 
in the form of beta values, which measure the ratio of 
methylation signal to (methylation -|- background) sig- 
nal. For convenience, the data were binarised using a 
threshold of /3 > 0.95. Features containing fewer than 
10 hits were then removed, leaving 769 features. 

2.4. Micro RNA 

We use the publicly-available level 3 microRNA data. 
This gave 490 tumour samples generated by UNC on 
the H-miRNA_8xl5K platform. 



The data include 10 normal samples. For each mi- 
croRNA in turn, a Wilcoxon rank-sum test was used to 
determine whether or not there was differential expres- 
sion between tumour and normal samples. Applying 
a Bonferroni correction and then keeping only genes 
with a p-value of p < 1 x 10^"^ gave us 104 features. 

2.5. Clinical Follow-up 

The corresponding clinical data were also downloaded. 
The files follow_up_vl.O_public_GBM.txt and clini- 
caLpatient_public_GBM.txt were used, matching the 
patients on the basis of the TCGA bar codes. We 
note that 51 of the 277 samples did not have complete 
clinical follow-up information. 




(a) 




(b) 

Figure 1. Graphical representation oi K — 3 DMA mix- 
ture models, (a) Independent case, (b) Modelling depen- 
dence between the latent component allocation variables 
(the MDI model). Figure from Kirk et al. (2012) 



3. Model 

We further develop the Multiple Data Integration 
(MDI) method of Kirk et al. (2012). MDI can be 
regarded as an extension of nonparametric Bayesian 
clustering methods to analyse simultaneously multiple 
data types, inferring the degree of similarity between 
the clustering structure in the data types. MDI hence 
produces clustering partitions for each data type, as 
well as an overall consensus clustering partition. 

We note that another particular strength of MDI is 
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that it infers the posterior distribution over the num- 
ber of clusters in each data type. Hence, we can infer 
the hkely number of clusters, given the data. Many 
clustering algorithms do not provide a method for do- 
ing this, which is a major shortcoming. 

3.1. The MDI model 

We consider a multi-dimensional data set, consisting 
of K distinct data types (for example, gene expression, 
copy number, methylation or microRNA expression). 
Each data type will contain measurements for the same 
set of items, so that for each items we have K vectors 
of measurements. We note that in general the numbers 
of features for each data type will be different from one 
another, and can be arbitrarily so for the MDI model. 

We model each data type using a finite approxima- 
tion to a Dirichlet process mixture model (Ishwaran 
& Zarepour, 2002), known as a Dirichlet-Multinomial 
Allocation (DMA) mixture model (Green & Richard- 
son, 2001). The K DMA models are coupled by a set 
of (j) parameters that allow for information-sharing be- 
tween the data types and provide a measurement of 
the level of similarity between pairs of types. Figure 1 
shows a graphical representation of the MDI model. 

The DMA mixture model for a single data type is given 
by the following equation. 



model as follows: 



N 



P{x) =^7rc/(a;|6'c). 



(1) 



Where tTc are the mixture proportions, / is a paramet- 
ric density (such as a Gaussian) that models the c-th 
data type, and 9c are the parameters associated with 
the c-th component. N is the maximum number of 
mixture components, which is set typically to be large 
enough so as to not impact on the inference. When N 
is large in this way, the behaviour of the DMA model 
approaches that of a Dirichlet process. To ensure this 
and as a tradeoff with computational cost, N — [n/2] 
throughout this paper. 

We note that different choices for the density / allow 
us to model different types of data (for example, a 
normal distribution might be appropriate for contin- 
uous data, while a multinomial might be appropriate 
for categorical data). This imparts great flexibility to 
the MDI model. 

Given observed data xi, . . . ,Xn, we wish to perform 
Bayesian inference for the unknown parameters in this 
model. We introduce latent component allocation vari- 
ables Cj G {1, • . • ,N}, such that Ci is the component 
responsible for observation Xi. We then specify the 



x^\c„e^F{0c,), 

CiJTT ^ Multinomial(7ri, . . . , Tr^r), 
TTi, . . . , TTjv ^ Dirichlet (a/TV, . . . , a/N), 



(2) 



where F is the distribution corresponding to density 
/, TT = (tti, . . . , TTjv) is the collection of N mixture pro- 
portions, a is a mass/concentration parameter (which 
may also be inferred), and G^^^^ is the prior for the 
component parameters. 

Now considering the full MDI model with K distinct 
data types, to couple the K mixture models, the fol- 
lowing conditional prior is introduced for the compo- 
nent allocation variables. 

K K-l K 

p{Cii,...,CiK\(i>) C^W^T^CkkW W (1 + (^fef^Qfc = Cif)) , 

fc=i fc=i e.=k+i 

(3) 
where I is the indicator function, (pki G M>o is a pa- 
rameter that controls the strength of association be- 
tween datasets k and £, and (p is the collection of all 
K{K - l)/2 of the (jy^is. c^k G {1, . . . , iV} is the com- 
ponent allocation variable associated with item i in 
model fc, and TTc.^fe is the mixture proportion associ- 
ated with component Cik in model k. Informally, the 
larger (j)ke, the more likely it is that Cik and ca will 
take the same value, and hence the greater the de- 
gree of similarity between the clustering structure of 
dataset k and dataset £. If all 4>ki = 0, then we recover 
the case of K independent DMA mixture models (Fig- 
ure lb). We constrain the (/)fc£ to be non- negative. 

We note that the Xij are assumed to be independent, 
given the clustering in MDI. The model then concen- 
trates on modeling the joint distribution of the allo- 
cation variables Cii^...,CiK which induces correlation 
over the x's. 

3.2. Data Models 

For the analyses in this paper, we use two different 
densities, f Gaussian and f multinomial- These are re- 
spectively used for real-valued and discrete data and 
make reasonable assumptions (for the data used in this 
paper) about the expected noise characteristics. 

For both data models, we assume that the features 
represent repeated, independent measurements of the 
underlying clustering partition. We therefore have the 
following equation. 



f^Uf- 



(4) 
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For the Gaussian density, we assume that each feature 
is modeled by a Gaussian hkelihood of unknown mean 
and precision, subject to a Normal-Gamma prior. 
There density is therefore closed form and we hence 
marginalise the mean and precision. 

We therefore have the following (marginal) likelihood 
function for the Gaussian case. 

Ja,Gaussian — TT7 \~M^{ ) {"^^ J ('^ 

r(ao) P« K„ 



Where 



/?: 



M - 


^ iV(0, (koA)-I) 


A r 


- Ga(ao,/3o) 


'n - 


= Ko + n 


•n 


n 
" "0+2 


;„ = 


= Po + li^ix- 



xY + 






(6) 

(7) 
(8) 

(9) 
(10) 



We set the Normal-Gamma hyperparameters to ao = 
2, Pq = 0.5 and kq = 0.001. 

For the multinomial density, we assume that each fea- 
ture is modeled by a multinomial likelihood, subject to 
a Dirichlet prior. The parameters of the multinomial 
likelihood are unknown, but because of the conjugate 
prior the density has a closed form and those param- 
eters can hence be marginalised, leaving only the hy- 
perparameters Prq to be defined. 

We therefore have the following (marginal) likelihood 
function for the multinomial case. 



J a 



multinomial 



Q 

n 



r(s. 



l\ TiS, + B,) ^\ T{/3r,) 



n 



i yXrq "r Prq) 



We set the Dirichlet prior hyperparameters to I3rq = 
0.5. 

3.3. Feature Selection 

Because of the potentially large number of features 
in the various omic data types, we extend the Gaus- 
sian and multinomial data models to include feature 
selection. To do this, we include binary indicator pa- 
rameters la for each feature in a given data type. This 
modifies Equation 4 to the following: 



/ - YliUa) + (1 - la) fa, 



(12) 



The factor for each feature will therefore either be fa 
or fa,nuii- The fa are as before, so if all la = 1 then 
we have the model with no feature selection. 



The fa,nuii represent the alternative model that the 
feature is uninformative and hence all items are mod- 
eled as belonging to a single mixture component. We 
also make an approximation that the likelihood pa- 
rameters are known, rather than marginalised over. 
This makes only a modest correction to the typical 
marginal likelihood values, but significantly speeds up 
the computation of the conditional distributions for 
Gibbs resampling. 

For data models taking the form in Equation 4, we 
note that the distributions for Gibbs resampling of the 
la are conditionally independent, given the Cik- As 
MDI is written in Matlab, this allows us to vectorise 
the computation of the conditional distributions for 
the Gibbs resampling of all the la ■ This vectorisation 
makes the feature selection in MDI highly computa- 
tionally efficient and fast to execute. 

3.4. Split-Merge MCMC sampling 

One characteristic of Gibbs samplers for mixture 
model clustering algorithms is that the MCMC chains 
can be relatively slow to mix. We have noticed this in 
particular for the number of occupied components. 

To improve this, we have implemented a version of the 
sequential split-merge MCMC sampler of Dahl (2005). 
The split-merge steps are applied separately to each 
of the K DMA models. These MCMC steps are pro- 
posed in addition to all the usual Gibbs steps described 
in Kirk ct al. (2012). The increase in computation re- 
quired for the split-merge steps is minor, and we find 
that while the acceptance rate for the steps is low, the 
overall effect is to substantially improve the mixing 
rate for the number of clusters for each data type. 



(11) 3.5. Extraction of Clustering Partitions 



We adopt a different approach to that of Kirk 
et al. (2012) to the extraction of clustering parti- 
tions from the posterior similarity matrix. Because 
the previously-used method (Fritsch & Ickstadt, 2009) 
is only available as an R package, for convenience we 
implement as part of the MDI code a simpler method 
based on a hierarchical clustering using the posterior 
similarity matrix. We note that the results in Fritsch 
& Ickstadt (2009) show that this approach produces 
similar performance to the previously-used method. 

Using as distances (1 - posterior similarity), we per- 
form standard hierarchical clustering with complete 
linkage. We set the number of clusters to be the MAP 
estimate of the number of clusters, taken from the 
MCMC analysis. 

The resulting clustering partition should be regarded 
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as a convenient summary of the full results. We would 
strongly encourage users of this method to not neglect 
other outputs such as the posterior similarity and fu- 
sion matrices. 

4. Results 

We analyse gene expression, copy number variation, 
micro- RNA and methylation data for 277 glioblastoma 
patients using MDI. We identify a range of distinct 
disease subtypes and results, the most interesting of 
which we now describe. 

Complete results can be found at 
https://sites.google.com/site/multipledatafusion/ 

We consider 5 distinct cases of summarising cluster- 
ing partition (for each individual type, and also the 
consensus of all 4). For these we consider 4 binary, 
right-censored clinical outcomes: death, new tumour 
event, tumour progression, tumour recurrence. This 
gives us 20 cases in all. 

For each case, we plot the Kaplan-Meier survival 
curves for the set of disease subtypes. Table 1 shows 
the log-rank p-values for these plots, after application 
of a Bonferroni correction for multiple hypothesis test- 
ing (nTests=20) . Examples of these plots can be seen 
in Figure 4. We note that in all cases we only con- 
sider subtypes containing at least 5 items and we only 
consider items for which there is clinical outcome in- 
formation. 

4.1. Consensus subtypes are prognostic for 
tumour recurrence 

We note that the consensus subtypes in general are 
strongly prognostic for tumour recurrence (log-rank 
p-value of 3.6 x 10"** after correction for multiple hy- 
pothesis tests) (see Figure 7). Because the methyla- 
tion status is measured at the point of diagnosis, this 
prognostic capability is predictive. 

4.2. Interesting low-methylation subtype 

Consensus subtype 7 has particularly interesting char- 
acteristics. Comprising 47 items, it shows only very 
low levels of tumour recurrence (see Figure 4) and no 
new tumour events. All items in this subtype show 
very low relative levels of methylation (see Figure 6). 
This subtype contains 23 women and 24 men, with a 
median age of 53 and an age range of 14 to 81. 



4.3. Gene expression subtypes are prognostic 
for survival outcome 

Figure 5 shows the Kaplan-Meier survival curves for 
the 8 gene expression (GE) subtypes identified by 
MDI. These subtypes are prognostic for survival out- 
come (log-rank p-value of 1.5 x 10^"^ after correction 
for multiple hypothesis tests). 

This result is largely driven by GE cluster 7, which 
consists of 6 patients with particularly poor survival 
outcome (see Figure 5). Of these 6 patients, 3 die 
within 6 months of diagnosis, and the other 3 are omit- 
ted from the survival analysis as we do not have infor- 
mation on their survival or otherwise (they are missing 
data). As such, this result relies on a small number of 
patients and should therefore be treated with caution. 
However, further study is certainly warranted in case 
this subtype remains distinct with a larger number of 
members. 

4.4. Partial overlap of clustering structure for 
different data types 

The fusion matrix (Figure 2 ) and consensus cluster- 
ing results show that there is a level of consistency 
in the clustering structure across the gene expression, 
copy number variation, methylation and microRNA 
data. However, inspection of the clustering partitions 
for each data type show that there are also differences 
in structure between each type. This indicates that a 
single clustering partition is not sufficient to capture 
all of the structure contained in the 4 data types. 

4.5. Evidence for a biologically distinct 
glioblastoma subtype 

Consensus cluster 5 consists of 8 patients and is 
noteworthy for highly distinctive data signatures in 
gene expression (over-expression), copy number (ex- 
cess copies) and micro-RNA (over-expression in a sub- 
set of selected features) . 

This subtype is poor for tumour recurrence, and we 
suggest that the striking data signatures are suggestive 
of a distinctive set of biological mechanisms driving 
this tumour subtype. 

4.6. Fusion matrix makes biological sense 

The (j}ki parameters provide information on the level 
of agreement about clustering structure between pairs 
of data types. Figure 2 shows the posterior mean val- 
ues for the (pki- The principal sharing of structure is 
shown to be between gene expression and the other 
three data types, while other pairs of data types are 
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Table 1. Bonferroni-corrected p- values for subtype Kaplan-Meier survival curves (nTests=20) For a given clinical outcome 
and data type/s, the Kaplan-Meier curves for each disease subtype are produced. P-values are computed using the log-rank 
test and considering the null hypothesis that all curves in a given set are drawn from the same underlying distribution. 



data type/s 


died 


new event 


progression 


recurrence 


all 


1 


0.66 


0.11 


3.6 X 10--* 


copy number 
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gene expression 


1.5 X 10-'^ 
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methylation 
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0.28 
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2.0 X 10-3 


microRNA 
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Figure 2. Matrix showing the posterior mean values for the 
(fiki- Shown are the copy number data (row 1), gene ex- 
pression (row 2), methylation (row 3), microRNA (row 4). 
We note that the diagonal elements are undefined for the 
MDI model and so are set to arbitrary values so as to make 
the colour table convenient. 



less strongly related. This confirms what would be ex- 
pected from prior knowledge of the underlying biolog- 
ical mechanisms. This is an important sanity check of 
the analysis, and shows that there is useful biological 
information in the data set. 

4.7. MCMC details 

The results presented in this paper are the result of 25 
MCMC chains, each of ~ 70, 000 samples. We sparse- 
sample the chains by a factor of 10 and remove the first 
25% of each chain as burn-in. We check for adequate 
convergence by visual inspection of the MCMC time- 
series and histograms for each chain, overlaid on one 
another. 




I 



I 



Figure 3. Posterior similarity matrix for the 277 data 
items. This matrix gives the posterior probability of given 
pairs of items belonging to the same cluster, averaged over 
the 4 data types. This is used to produce the consen- 
sus clustering partition. The matrix shown here has been 
sorted by the resulting partition, to better show the struc- 
ture. The X-axis is labelled by alternating sets of bars/dots, 
denoting the different clusters. The LHS y-axis bars show 
the relative probability that a given item has the same 
cluster label across data types. 



5. Conclusions 

We have presented extensions to the MDI method 
that make it suitable for analysing multi-dimensional 
molecular cancer data sets. Using MDI to analyse the 
TCGA glioblastoma data, we have identified a number 
of important points. 
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• Both the 8 methylation and 8 data-consensus dis- 
ease subtypes we have identified are significantly 
prognostic of tumour recurrence 

• The 8 gene expression disease subtypes we have 
identified arc significantly prognostic of survival 
outcome 

• We have identified a strongly prognostic glioblas- 
toma subtype, noteworthy for its low levels of tu- 
mour recurrence methylation 

• We have identified a small subtype of 6 patients 
based on gene expression for which there is very 
poor survival outcome, and postulate that this 
may identify a rare and agressive form of glioblas- 
toma 

• We have also identified a subtype of 8 patients 
with a highly distinctive data signature. This 
subtype has poor tumour recurrence, and it may 
represent a subtype whose underlying biology is 
highly distinctive, which may allow for more tar- 
getted therapy 

• We note that the clustering structures for the 4 
different data types overlap partially, but there is 
significantly more structure than can be explained 
by a single partition 

Modern, large-scale cancer data sets contain a wealth 
of data types measuring the effects of genomic and 
environmental processes. We have demonstrated the 
effectiveness of combining these data types into a sin- 
gle analysis, and shown that the richness of structure 
contained by such multi-dimensional data sets neces- 
sitates statistical methods capable of capturing that 
richness. 
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Tumour Recurrance (pValue=1 .79e-05) 
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Figure 4. Kaplan-Meier survival curves plotting tumour recurrence for the consensus disease subtypes. The p-value is 
computed using a log-rank test. When quoted in Table 1, a Bonferroni correction has been applied to account for multiple 
hypothesis tests. 
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Figure 5. Kaplan-Meier survival curves for the gene expression disease subtypes. The p- value is computed using a log-rank 
test. When quoted in Table 1, a Bonferroni correction has been applied to account for multiple hypothesis tests. 



Finding cancer subtypes in glioblastoma 



Mean number of methylated probes in the 'AM' clusters 
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Figure 6. The mean number of methylated sites in each consensus subtype (out of a possible 769) . A site is counted as 
methylated if it has been binarised to have unit value in the input methylation data in this analysis. 
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Figure 7. Plot of the 4 data types. The x-axis gives the data items (patients/tumour samples), sorted by the consensus 
clustering partition. The y-axis gives the selected features for each data type, with the dashed lines on the LHS indicating 
P(selected) for each feature. We note that any feature with P (selected) <0.1 is excluded from this plot, and also that the 
outlying 5% of pixels in each data type are clipped for the purposes of plotting. 
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Enrichment Map of Hypergeometric tests on 
"GO.BiologicalProcess" 
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Enrichment Map of Hypergeometric tests on 
"GO.MolecularFunction" 
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Enrichment Map of Hypergeometric tests on 
"GO.CellularComponent" 
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Enrichment Map of Hypergeometric tests on 
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